Prediction Methods

Author
Affiliation

Dave Clark

Binghamton University

Published

February 5, 2025

Prediction Methods

These slides illustrate different methods of generating predictions from multivariate models. Generating quantities of interest from models is a key part of the modeling process. We can use these predictions to understand the model, and to evaluate counterfactuals.

Quantities of interest generally include a prediction of \(y\) for a given set of \(x\) values, and the uncertainty around that prediction. We can also compute the marginal effects of \(x\) on \(y\), and the uncertainty around those marginal effects.

Here, we’ll examine three methods:

  • At-means predictions (adjusted effects)
  • Average effects
  • Simulated effects

These methods are all flexible and adaptable, and are what we commonly use in linear (OLS) and non-linear (MLE) models. So you’ll see these methods quite in quantitative social science modeling.

Major League Baseball Attendance

We’ll use a dataset of Major League Baseball attendance to illustrate different methods of generating predictions from multivariate models.

code
mlb <- read.csv("/users/dave/documents/teaching/501/2023/slides/L3_multivariate/code/mlbattendance/MLBattend.csv")  

# rescale home attendance
mlb$Home_attend <- mlb$Home_attend/1000

datasummary(All(mlb) ~ mean+ median+ sd+min+max,  data=mlb, style="grid", title = "MLB attendance data")
MLB attendance data
mean median sd min max
Season 1985.06 1985.00 9.27 1969.00 2000.00
Home_attend 1777.99 1681.90 755.87 306.76 4483.35
Runs_scored 694.94 691.50 105.17 329.00 1009.00
Runs_allowed 694.89 693.00 105.52 331.00 1103.00
Wins 78.85 79.00 12.67 37.00 114.00
Losses 78.88 79.00 12.65 40.00 110.00
Games_behind 14.39 13.00 11.75 0.00 52.00

Bivariate model, predictions

Here’s a simple model predicting season attendance at MLB games, regressing home attendance on runs scored, then generating “in-sample” predictions by computing:

\[\widehat{y} = \widehat{\beta}_0 + \widehat{\beta}_1 x_i\]

where \(x_i\) is the number of runs scored for each row/observation in the estimation data.

code
m1 <- lm(data=mlb, Home_attend ~  Runs_scored)
#modelsummary(m1)

stargazer(m1, type="html", title="Bivariate model of MLB attendance")
Bivariate model of MLB attendance
Dependent variable:
Home_attend
Runs_scored 3.554***
(0.216)
Constant -691.488***
(151.853)
Observations 838
R2 0.244
Adjusted R2 0.244
Residual Std. Error 657.405 (df = 836)
F Statistic 270.514*** (df = 1; 836)
Note: p<0.1; p<0.05; p<0.01
code
p1 <- data.frame(predict(m1), mlb$Runs_scored)

ggplot(p1, aes(x=mlb.Runs_scored, y=predict.m1.)) + 
  geom_line() +
  xlab("Runs Scored") +
  ylab("Predicted Attendance (in thousands)")   

Multivariate model, in-sample predictions

Here’s a model of attendance at MLB games, regressing home attendance on runs scored, runs allowed, season (year), and games behind, then generating “in-sample” predictions.

code
m3 <- lm(data=mlb, Home_attend ~  Runs_scored + Runs_allowed+ Season +Games_behind)
# modelsummary(m3)

stargazer(m3, type="html", title="Multivariate model of MLB attendance")
Multivariate model of MLB attendance
Dependent variable:
Home_attend
Runs_scored 3.123***
(0.338)
Runs_allowed -1.887***
(0.357)
Season 36.215***
(2.284)
Games_behind -8.273***
(2.755)
Constant -70,849.850***
(4,473.400)
Observations 838
R2 0.471
Adjusted R2 0.469
Residual Std. Error 550.845 (df = 833)
F Statistic 185.758*** (df = 4; 833)
Note: p<0.1; p<0.05; p<0.01
code
p3 <- data.frame(predict(m3), mlb$Runs_scored)
ggplot(p3, aes(x=mlb.Runs_scored, y=predict.m3.)) + 
  geom_line() +
  xlab("Runs Scored") +
  ylab("Predicted Attendance (in thousands)")   

As you can see, the predictions are pretty ugly because all four \(x\) variables are varying by observation, so we’re getting predictions based on each individual observation’s characteristics - interesting if we care about the 1977 Red Sox, but not something we can draw generalities from. What’s missing here?

Control

Hold all variables constant at means, medians, or modes, except the one of interest, so we can focus on how changes in the \(x\) of interest affect \(\widehat{y}\).

Prediction Methods for Multivariate Models

There are lots of ways to generate predictions - this covers three common and flexible types. In general, these are “out of sample” methods - they are often not relying on the estimation data, but on new data intended to represent key features of the estimation data. They all produce predictions for the number of “interesting values” on your \(x\) variable of interest. So they generally do not produce \(N\) predictions.

Prediction methods

These three approaches are all very flexible/adaptable, and are what we commonly use in linear (OLS) and non-linear (MLE) models. So you’ll see these methods quite in quantitative social science modeling.

At-means predictions (adjusted effects)

At-means predictions (also called “adjusted effects”) set all variables except the \(x\) of interest at a sensible value - usually the mean, median, or mode depending on the level of measurement. Holding those constant, but varying just the \(x\) of interest produces predictions of \(y\) that are neater and easier to discuss than the in-sample ones above.

Summarize the estimation data

Let’s find the means etc. of the variables in the model. It’s important only to consider the cases that are in the estimation data, that is, actually used in the model. Write code to identify cases in the model, then use the means, etc. of these for the predictions:

mlb$used <- TRUE
mlb$used[na.action(m3)] <- FALSE
estdata <- mlb %>%  filter(used=="TRUE")
code
mlb$used <- TRUE
mlb$used[na.action(m3)] <- FALSE
estdata <- mlb %>%  filter(used=="TRUE")

datasummary(Home_attend+Runs_scored +Runs_allowed + Season + Games_behind ~ mean+ median+ sd+min+max,  data=estdata, style="grid", title = "MLB estimation data")
MLB estimation data
mean median sd min max
Home_attend 1777.99 1681.90 755.87 306.76 4483.35
Runs_scored 694.94 691.50 105.17 329.00 1009.00
Runs_allowed 694.89 693.00 105.52 331.00 1103.00
Season 1985.06 1985.00 9.27 1969.00 2000.00
Games_behind 14.39 13.00 11.75 0.00 52.00

In this case, it turns out we use all the cases in the data - it’s important to check this any time you’re making model predictions.

Generate at-mean predictions

Create a new data frame with as many observations as there are interesting values of the \(x\) variable of interest. Then, set all variables except the \(x\) of interest at their means, medians, or modes. Using the standard errors of the predictions, generate the boundaries of the confidence intervals by end point transformation.

\[ \widehat{y} \pm 1.96 \times se(\widehat{y}) \]

code
oosdata <- data.frame(Intercept=1, Runs_allowed=694 , Season=1985 , Games_behind=14 , Runs_scored= c(seq(330,1000,10)))

mlb.predict <- data.frame(oosdata, predict(m3, interval="confidence", se.fit=TRUE, level=.05, newdata=oosdata))

# confidence bounds by end point transformation
mlb.predict <- mlb.predict %>% mutate(ub=fit.fit+1.96*se.fit) %>% mutate(lb=fit.fit-1.96*se.fit)

atmean <- ggplot(mlb.predict, aes(x=Runs_scored, y=fit.fit)) + 
  geom_line() + 
  geom_ribbon(aes(ymin=lb, ymax=ub), alpha=.2) + 
  xlab("Runs Scored") + 
  ylab("Expected Attendance (in thousands)") + 
  theme_minimal() +
  ggtitle("At-mean effects")

atmean

R’s predict() function does this for us, but it’s important to understand the process. The predict() function generates the predictions and the standard errors of those predictions. The interval="confidence" option tells R to generate the confidence intervals, and the se.fit=TRUE option tells R to generate the standard errors of the predictions.

Aside on standard errors of \(\widehat{y}\)

The standard errors of \(\widehat{y}\) are calculated using the \(x\) values, and the variance-covariance matrix of the coefficients. The maximum likelihood method is as follows:

\[ \sqrt{diag(\mathbf{X V X'})} \]

Where \(X\) is the matrix of \(x\) values (probably the constructed, out of sample \(X\) matrix), \(V\) is the variance-covariance matrix of \(\widehat{\beta}\), and \(X'\) is the transpose of the \(X\) matrix. The standard errors of \(\widehat{y}\) are calculated by taking the square root of the main diagonal of this matrix.

Uncertainty about \(\widehat{y}\)

The intuition is that we’re using our uncertainty about the coefficients to generate measures of uncertainty about the predictions.

Average effects, end point boundaries

Average effects are where we’re computing \(N\) predictions for every interesting value of \(x\), then taking the average of those predictions for each value of \(x\). This is a counterfactual approach - we’re using the actual data, changing only the variable of interest, as if all observations took on the same value of that variable.

For instance, in our MLB attendance model, we change the value of “Runs Scored” to the minimum value, 329, for every observation, keeping all the other variables as they are in the real estimation data. Compute the predictions for all the observations at \(x = 329\), then take the average of those predictions - now iterate to 330, 331, etc.

Counterfactuals

We are treating the estimation data as if every team scored 329 runs - this is the counterfactual - then computing the average attendance for that counterfactual, and doing this for all counterfactuals (values of \(x\)).

It’s important to note we are using the estimation data to compute average effects, looping over values of the variable of interest, then saving the average (median) prediction and the median standard error of that prediction.

code
# preserve the original values of Runs Scored
mlb <- mlb%>%
mutate(original_runs_scored = Runs_scored)

xb = 0
se = 0
rs = 0

for(i in seq(330,1000,1)){  #iterating by 1 to max number of obs we're taking medians for
  mlb$Runs_scored <- i
  mlb.predict <- data.frame(predict(m3, interval="confidence", se.fit=TRUE, level=.05, newdata=mlb))
  xb[i-329] <- median(mlb.predict$fit.fit) #index using i but minus constant to start at row 1
  se[i-329] <- median(mlb.predict$se.fit)
  rs[i-329] <- i
}

avg.pred <- data.frame(xb, se, rs)

# upper and lower bounds by end point transformation

avg.pred <- avg.pred %>% mutate(ub=xb+1.96*se) %>% mutate(lb=xb-1.96*se)

# reset the estimation data to the actual values of Runs Scored
mlb <- mlb%>%
  mutate(Runs_scored=original_runs_scored )

#plot
average <- ggplot(avg.pred, aes(x=rs, y=xb)) + 
  geom_line() + 
  geom_ribbon(aes(ymin=lb, ymax=ub), alpha=.2) + 
  xlab("Runs Scored") + 
  ylab("Expected Attendance (in thousands)") +
  ggtitle("Average Effects") +
  theme_minimal()

average

Comparing at-mean and average effects

code
atmean + average

Simulated effects

The last method we’ll consider is simulated effects. This is a way to generate predictions that are based on the distribution of the coefficients. The \(\widehat{\beta}\)s are estimates of the mean of a normal distribution, and the var-cov matrix of the coefficients is an estimate of the variance of that distribution. We can assume the distribution of \(\beta\) is Normal because of the Central Limit Theorem.

Here’s the process. Treat our \(\widehat{\beta}\)s as the mean of a normal distribution, and the var-cov matrix of the coefficients as the variance. Generate a large number of draws from that distribution, say 10,000. This will give us the simulated distribution of \(\widehat{\beta}\). We’ll have 10,000 estimates of \(\widehat{\beta}\).

Using these, we can now produce 10,000 predictions for each interesting value of \(x\). We can then summarize the distribution of those predictions, using the median as our point estimate, and the percentiles (2.5, 97.5) as our confidence boundaries.

code
B <- data.frame(rmvnorm(n=10000, mean=coef(m3), vcov(m3))) #simulated distribution of the estimates using the var-cov matrix of B as the variance, and the estimates of B as the mean.

colnames(B) <-c('b0', 'b1', 'b2', 'b3', 'b4')

sim.preds <- data.frame ( lb = numeric(0),med= numeric(0),
                            ub= numeric(0), r = numeric(0))

for(i in seq(330,1000,1)){
  xbR  <- quantile(B$b0 + B$b1*i + B$b2*694 + B$b3*1985 +B$b4*14, probs=c(.025,.5,.975))
  xbR<- data.frame(t(xbR))
  sim.preds[(i-329):(i-329),1:4] <- data.frame(xbR, i)
}

#plot
sim <- ggplot(sim.preds, aes(x=r, y=med)) + 
  geom_line() + 
  geom_ribbon(aes(ymin=lb, ymax=ub), alpha=.2) + 
  xlab("Runs Scored") + 
  ylab("Expected Attendance (in thousands)") +
  ggtitle("Simulated Effects") +
  theme_minimal()

sim

Comparing methods

code
(atmean + average + sim)

Summary

These look quite similar, so what’s the point of these different approaches? Different data, models, and questions will sometimes point you toward a particular method. For instance, fixed effects models will likely push you to average effects. We’ll encounter a variety of circumstances that will make one or another of these methods more or less appropriate both in this class and in the non linear models class.

Back to top